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Fast Moment Estimation in Data Streams in Optimal Space 
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Abstract 

We give a space-optimal algorithm with update time 0(log 2 (l/e) loglog(l/e)) for (1 ± e)- 
approximating the pth frequency moment, < p < 2, of a length-n vector updated in a data 
stream. This provides a nearly exponential improvement in the update time complexity over the 
previous space-optimal algorithm of [Kane-Nelson- Woodruff, SODA 2010], which had update 
time ft(l/e 2 ). 



CN ! 1 Introduction 

{/) . The problem of estimating frequency moments of a vector being updated in a data stream was first 

Q | studied by Alon, Matias, and Szegedy [3] and has since received much attention [5| IB] |2"T | |2"2" | l2"o ] 125 1 

[30 | HU 1 Rd] . Estimation of the second moment has applications to estimating join and self-join sizes 

O . [2] and to network anomaly detection \27\ [3"T] . First moment estimation is useful in mining network 

traffic data [11], comparing empirical probability distributions, and several other applications (see 

[30] and the references therein). Estimating fractional moments between the 0th and 2nd has 

applications to entropy estimation for the purpose of network anomaly detection |20[ W2\ , mining 

Q\ | tabular data [9], image decomposition [17] . and weighted sampling in turnstile streams [29]. It 

was also observed experimentally that the use of fractional moments in (0, 1) can improve the 

effectiveness of standard clustering algorithms [I] . 

Formally in this problem, we are given up front a real number p > 0. There is also an underlying 
n-dimensional vector x which starts as 0. What follows is a sequence of m updates of the form 
(ii,vi), . . . , (i m ,v m ) G [n] x {— M, ...,M} for some M > 0. An update (i,v) causes the change 

Xi ^— Xi + v. We would like to compute F p = \\x\\p = Ya=1 \ x i\ P i a ^ so called the pth frequency 
S^ . moment of x. In many applications, it is required that the algorithm only use very limited space 

while processing the stream, e.g., in networking applications where x may be indexed by source- 
destination IP pairs and thus a router cannot afford to store the entire vector in memory, or in 
database applications where one wants a succinct "sketch" of some dataset, which can be compared 
with short sketches of other datasets for fast computation of various (dis) similarity measures. 

Unfortunately, it is known that linear space (J7(min{n, m}) bits) is required unless one allows for 
(a) approximation, so that we are only guaranteed to output a value in [(1 — e)F p , (l + e)F p ] for some 
< e < 1/2, and (b) randomization, so that the output is only guaranteed to be correct with some 
probability bounded away from 1, over the randomness used by the algorithm [3]. Furthermore, 
it is known that polynomial space is required for p > 2 [51 [TE[ [23"1 136] , while it is known that 
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Figure 1: Comparison of our contribution to previous works on Fp-estimation in data streams. All 
space bounds hide an additive O(loglogn) term. 

the space complexity for < p < 2 is only 0(e~ 2 log(mM) + loglog(n)) bits to achieve success 
probability 2/3 [3j [26] , which can be amplified by outputting the median estimate of independent 
repetitions. In this work, we focus on this "feasible" regime for p, < p < 2, where logarithmic 
space is achievable. 

While there has been much previous work on minimizing the space consumption in streaming 
algorithms, only recently have researchers begun to work toward minimizing update time [34\ Ques- 
tion 1], i.e., the time taken to process a new update in the stream. We argue however that update 
time itself is an important parameter to optimize, and in some scenarios it may even be desirable 
to sacrifice space for speed. For example, in network traffic monitoring applications each packet is 
an update, and thus it is important that a streaming algorithm processing the packet stream be 
able to operate at network speeds (see for example the applications in [271 EZ] ) • Note that if an 
algorithm has update time say, f2(l/e 2 ), then achieving a small error parameter such as e = .01 
could be intractable since this time is multiplied by the length of the stream. This is true even if 
the space required of the algorithm is small enough to fit in memory. 

For p = 2, it is known that optimal space and 0(1) update time are simultaneously achievable 
[HI [37], improving upon the original F% algorithm of Alon, Matias, and Szegedy [3]. For p = 1 it is 
known that near-optimal, but not quite optimal, space and 0(log(n/e)) update time are achievable 
|30| . Meanwhile, optimal (or even near-optimal) space for other p G (0,2] is only known to be 
achievable with poly(l/e) update time [26] . 

Our Contribution: For all < p < 2 and < e < 1/2 we give an algorithm for (1 ± e)- 
approximating F p with success probability at least 2/3 which uses an optimal 0(e~ 2 log(mM) + 
log log n) bits of space with 0(log 2 (l/e) loglog(l/e)) update timelH This is a nearly exponential 
improvement in the time complexity of the previous space-optimal algorithm for every such p. 

1.1 Previous Work 

The complexity of streaming algorithms for moment estimation has a long history; see Figure [T] for 
a comparison of our result to that of previous work. 

Alon, Matias, and Szegedy were the first to study moment estimation in data streams [3] and 
gave a space-optimal algorithm for p = 2. The update time was later brought down to an optimal 



throughout this document we say g = O(f) if g — 0(f -polylog(/)). Similarly, g = Q(f) if g = Sl(//polylog(/)). 



O(l) implicitly in [8] and explicitly in |37| . The work of [13] gave a space-optimal algorithm for 
p = 1, but under the restriction that each coordinate is updated at most twice, once positively 
and once negatively. Indyk [21] later removed this restriction, and also gave an algorithm handling 
all < p < 2, but at the expense of increasing the space by a logn factor. Li later [28] provided 
alternative estimators for all < p < 2, based on Indyk's sketches. The extra logn factor in the 
space of these algorithms was later removed in |26j . yielding optimal space. The algorithms of 
[T3"l |2~T| [26] [28] all required poly(l/e) update time. Nelson and Woodruff [31] gave an algorithm 
for p = 1 in the restricted setting where each coordinate is updated at most twice, as in [13], with 
space suboptimal by a log(l/e) factor, and with update time log (mM). They also later gave an 
algorithm for p = 1 with unrestricted updates which was suboptimal by a logn factor, but had 
update time only 0(log(n/e)) [30] . 

On the lower bound front, a lower bound of 0(min{n, m, e~ 2 log(e 2 mM)} + log log(nmM)) was 
shown in [26], together with an upper bound of 0(e~ 2 log(mM) + log log n) bits. For nearly the full 
range of parameters these are tight, since if e < \/y/m we can store the entire stream in memory 
in 0(mlog(nM)) = 0(e~ 2 log(nM )) bits of space (and we can ensure n = 0(m 2 ) via FKS hashing 
[T3] with just an additive O (log logn) bits increase in space), and if e < l/y^ we can store the 
entire vector in memory in 0(nlog(mM)) = 0{e~ 2 log(mM)) bits. Thus, a gap exists only when 
e is very near l/-y/min{n, m}. This lower bound followed many previous lower bounds for this 
problem, given in [3] HJ [24l |40| HI] . For the case p > 2 it was shown that Q.{n l ~ 2 ' p ) space is 
required [5] [TJ [181 [231 [36] , an d this was shown to be tight up to poly(log(nmM)/e) factors [6] [22]. 

1.2 Overview of our approach 

At the top level, our algorithm follows the general approach set forth by |30j for iq-estimation. In 
that work, the coordinates i £ {1, . . . ,n} were split up into heavy hitters, and the remaining light 
coordinates. A (f)-heavy hitter with respect to F p is a coordinate i such that \xi\ p > 4>\\x\\p. A list 
L of e 2 -heavy hitters with respect to iq were found by running the CountMin sketch of [10] . 

To estimate the contribution of the light elements to iq, [30] used R = 0(l/e 2 ) independent 
Cauchy sketches Zq, . . . ,Dr (actually, Dj was a tuple of 3 independent Cauchy sketches). A 
Cauchy sketch of a vector x, introduced by Indyk [21], is the dot product of x with a random vector 
z with independent entries distributed according to the Cauchy distribution. This distribution 
has the property that (z,x) is itself a Cauchy random variable, scaled by ||a;||i. Upon receiving 
an update to Xi in the stream, the update was fed to D^ for some hash function h : [n] — >■ 
[R]. At the end of the stream, the estimate of the contribution to iq from light elements was 
(R/(R — \h(L)\)) ■ J2jih(L) EstLii(Z)j), where EstLi p is Li's geometric mean estimator for F p [28] . 
The analysis of [30] only used that Li's geometric mean estimator is unbiased and has a good 
variance bound. 

Our algorithm LightEstimator for estimating the contribution to F p from light coordinates for p ^ 
1 follows the same approach. Our main contribution here is to show that a variant of Li's geometric 
mean estimator has bounded variance and is approximately unbiased (to within relative error e) 
even when the associated p-stable random variables are only k-wise independent for k = Q(l/e p ). 
This variant allows us to avoid Nisan's pseudorandom generator [32] and thus achieve optimal space. 
While the work of [26] also provided an estimator avoiding Nisan's pseudorandom generator, their 
estimator is not known to be approximately unbiased, which makes it less useful in applications 
involving the average of many such estimators. We evaluate the necessary k-wise independent 
hash function quickly by a combination of buffering and fast multipoint evaluation of a collection 



of pairwise independent polynomials. Our proof that bounded independence suffices uses the FT- 
mollification approach introduced in [26] and refined in [12] , which is a method for showing that the 
expectation of some function is approximately preserved by bounded independence, via a smoothing 
operation (FT- mollification) and Taylor's theorem. One novelty is that while [12^126] only ever dealt 
with FT-mollifying indicator functions of regions in Euclidean space, here we must FT-mollify 
functions of the form f(x) = |rc| 1 '* . To achieve our results, we express E[/(x)] = J °° f(x)(p p (x)dx 
as / °° f'(x)(l — Q p (x))dx via integration by parts, where ip p is the density function of the absolute 
value of the p-stable distribution, and 3> p is the corresponding cumulative distribution function. 
We then note 1 — $ p (x) = Pr[|X| > x] = E[/u 00 ) U (_. 00 ^(Jf)] f° r X p-stable, where Is is the 
indicator function of the set S. We then FT-mollify I[ x ,oo)u(—oo,—x]) which is the indicator function 
of some set, to write E[/(x)] as a weighted integral of indicator functions, from which point we can 
apply the methods of [12j |21] . 

In order to estimate the contribution to F p from coordinates in L, we develop a novel data 
structure we refer to as HighEnd. Suppose L contains all the a-heavy hitters, and every index in 
L is an (a/2)-heavy hitter. We would like to compute ||a;i||p ± 0(e) ■ \\x\\p, where a = Q(e 2 ). We 
maintain a matrix of counters Dj^ for (j,k) G [t] x [s] for t = 0(log(l/e)) and s = 0(l/a). For 
each j G [t] we have a hash function h? : [n] — > [s] and g 3 : [n] — > [r] for r = 0(log(l/e)). The 
counter Dj^ then stores YlhHv)=k e 2m93 ^ v '' r x v for i = y/—l. That is, our data structure is similar 
to the CountSketch data structure of Charikar, Chen, and Farach-Colton [8], but rather than taking 
the dot product with a random sign vector in each counter, we take the dot product with a vector 
whose entries are random complex roots of unity. At the end of the stream, our estimate of the 
-Fp-contribution from heavy hitters is 
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The choice to use complex roots of unity is to ensure that our estimator is approximately unbiased, 
stemming from the fact that the real part of large powers of roots of unity is still in expectation. 
Here Re[z] denotes the real part of z, and j(w, k) denotes the feth smallest value b G [t] such that 
h isolates w from the other w' G L (if fewer than t/3 such b exist, we fail). The subroutine Filter 
for estimating the heavy hitter contribution for p = 1 in |30] did not use complex random variables, 
but rather just used the dot product with a random sign vector as in CountSketch. Furthermore, it 
required a 0(log(l/e)) factor more space even for p = 1, since it did not average estimates across 
Q,(t) levels to reduce variance. 

For related problems, e.g., estimating F p for p > 2, using complex roots of unity leads to sub- 
optimal bounds [15]. Moreover, it seems that "similar" algorithms using sign variables in place 
of roots of unity do not work, as they have a constant factor bias in their expectation for which 
it is unclear how to remove. Our initial intuition was that an algorithm using p-stable random 
variables would be necessary to estimate the contribution to F p from the heavy hitters. However, 
such approaches we explored suffered from too large a variance. 

In parallel we must run an algorithm we develop to find the heavy hitters. Unfortunately, this 
algorithm, as well as HighEnd, use suboptimal space. To overcome this, we actually use a list of 
e 2 -heavy hitters for e = e • log(l/e). This then improves the space, at the expense of increas- 
ing the variance of LightEstimator. We then run 0((e/e) 2 ) pairwise independent instantiations of 
LightEstimator in parallel and take the average estimate, to bring the variance down. This increases 



some part of the update time of LightEstimator by a log (1/e) factor, but this term turns out to 
anyway be dominated by the time to evaluate various hash functions. Though, even in the ex- 
treme case of balancing with e = 1, our algorithm for finding the heavy hitters algorithm requires 
f2(log(rt) log(mM)) space, which is suboptimal. We remedy this by performing a dimensionality 
reduction down to dimension poly(l/e) via hashing and dot products with random sign vectors. 
We then apply HighEnd to estimate the contribution from heavy hitters in this new vector, and we 
show that with high probability the correctness of our overall algorithm is still maintained. 

1.3 Notation 

For a positive integer r, we use [r] to denote the set {1, . . . , r}. All logarithms are base-2 unless 
otherwise noted. For a complex number z, Re[z] is the real part of z, Im[z] is the imaginary part 

of z, z is the complex conjugate of z, and \z\ = \fz r z. At times we consider random variables X 
taking on complex values. For such random variables, we use Var[X] to denote E[|X — E[A]| 2 ]. 
Note that the usual statement of Chebyshev's inequality still holds under this definition. 

For x € M. n and SC [n], xs denotes the n-dimensional vector whose ith coordinate is Xi for % € S 
and otherwise. For a probabilistic event £, we use 1^ to denote the indicator random variable 
for 8. We sometimes refer to a constant as universal if it does not depend on other parameters, 
such as n, m, e, etc. All space bounds are measured in bits. When measuring time complexity, 
we assume a word RAM with machine word size Q(\og(nmM)) so that standard arithmetic and 
bitwise operations can be performed on words in constant time. We use reporting time to refer to 
the time taken for a streaming algorithm to answer some query (e.g., "output an estimate of F p "). 

Also, we can assume n = 0(m 2 ) by FKS hashing |14| with an additive O (log log n) term in 
our final space bound; see Section A. 1.1 of [26] for details. Thus, henceforth any terms involving 
n appearing in space and time bounds may be assumed at most m 2 . We also often assume that 
n, m, M, e, and 8 are powers of 2 (or sometimes 4), and that 1/y/n < e < £q for some universal 
constant £o > 0. These assumptions are without loss of generality. We can assume e > 1/y/n since 
otherwise we could store x explicitly in memory using 0(nlog(mM)) = 0{e~ 2 log(mM)) bits with 
constant update and reporting times. Finally, we assume ||a;||p > 1. This is because, since x has 
integer entries, either ||x[|p > 1, or it is 0. The case that it is only occurs when x is the vector, 
which can be detected in O (log (nmM)) space by the AMS sketch [3]. 

1.4 Organization 

An "Fp (^>-heavy hitter" is an index j such that \xA > 0||x||p. Sometimes we drop the U F P " if p 
is understood from context. In Section [21 we give an efficient subroutine HighEnd for estimating 
||x.l||p to within additive error e||x||p, where L is a list containing all a-heavy hitters for some 
a > 0, with the promise that no i G L is not an a/2-heavy hitter. In Section [3] we give a subroutine 
LightEstimator for estimating ||xr n i\£,||p. Finally, in Section HI we put everything together in a way 
that achieves optimal space and fast update time. We discuss how to compute L in Section [A. 1[ 

2 Estimating the contribution from heavy hitters 

Before giving our algorithm HighEnd for estimating ||xx,||p, we first give a few necessary lemmas 
and theorems. 



The following theorem gives an algorithm for finding the </>- heavy hitters with respect to F p . 
This algorithm uses the dyadic interval idea of [10] together with a black-box reduction of the 
problem of finding F p heavy hitters to the problem of estimating F p . Our proof is in Section fA.il 
We note that our data structure both improves and generalizes that of [16], which gave an algorithm 
with slightly worse bounds that only worked in the case p = 1. 

Theorem 1. There is an algorithm F p HH satisfying the following properties. Given < <fr < 1 and 
< 8 < 1, with probability at least 1 — 8, F p HH produces a list L such that L contains all <p-heavy 
hitters and does not contain indices which are not 4>/2-heavy hitters. For each i G L, the algo- 
rithm also outputs sign(xj), as well as an estimate x« of x% satisfying x\ G [(6/7)|xj| p , (9/7)|xj| p ]. 
Its space usage is 0(4>~ 1 log((j)n)log(nmM)log(log((j)n)/(8(j)))). Its update time is O(log(0rt) ■ 
log (log (4>n)/ (8(ft)). Its reporting time is 0(<p~ 1 (log((pn) ■ log(log((fm) / (8 (/>))))■ 

The following moment bound can be derived from the Chernoff bound via integration, and is 
most likely standard though we do not know the earliest reference. A proof can be found in 



Lemma 2. Let X\, . . . , X n be such that Xi has expectation [ii and variance of, and Xi < K almost 
surely. Then if the Xi are (.-wise independent for some even integer £ > 2, 
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where fi = ^- /ij and a 2 = ^» o\. In particular, 
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by Markov's inequality on the random variable (X^-Xj — /i) £ . 

Lemma 3 (Khintchine inequality [E]). For x G M. n , t > 2, and uniformly random z G { — 1, l} n , 

E z [|(x,z)|*]<||x||* 2 -Vt t . 

In the following lemma, and henceforth in this section, i denotes y/—l- 
Lemma 4. Let x e R n be arbitrary. Let z G { e 2™/r ^ e 2ni-2/r ^ e 2irv3/r^ _ _ 
such vector for r > 2 an even integer. Then for t > 2 an even integer, E 2 [ 

Proof. Since x is real, | (x, z) \ 
inequality, 
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Since r is even, we may write Re[zj] as (— 1)^' |Re[zj]| and Im[zj] as (— l)^'|Im[zj]|, where y,y' G 
{ — 1, l} n are random sign vectors chosen independently of each other. Let us fix the values of 
|Re[zj]| and |Im[zj]| for each j G [n], considering just the randomness of y and y' . Applying 
Lemma[3]to bound each of the expectations in Eq. ([1]), we obtain the bound 2*' 2 -\ft •(HfrH^ + H&'lll) < 
2 t / 2 - v / i t -(||6||i + ||6 / |||)' /2 where bj = Bjb[zj]-Xj and b) = Jxo[zj]-Xj. But this is just 2 £ / 2 - V** - ll^lll 
| 2 = 1. ■ 

2.1 The HighEnd data structure 

In this section, we assume we know a subset L C [n] of indices j so that 

1. for all j for which \xj\ p > a\\x\\p, j G L, 

2. if j G L, then \ Xj \P > (a/2)||x||J, 

3. for each j G L, we know sign(xj). 

for some < a < 1/2 which we know. We also are given some < e < 1/2. We would like to 
output a value ||:el|Ip ± 0(e)||x||p with large constant probability. We assume l/a = 0{\/e 2 ). 

We first define the BasicHighEnd data structure. Put s = |~4/a] . We choose a hash function 
h : [n] — > [s] at random from an r^-wise independent family for r^ = 0(log(l/a)). Also, let 
r = 0(log 1/e) be a sufficiently large even integer. For each j G [n], we associate a random complex 
root of unity e 2nl9 ^'' r , where g : [n] — > [r] is drawn at random from an r g -wise independent family 
for r g = r. We initialize s counters b±, . . . , b s to 0. Given an update of the form (j, v), add e 2nl9 ^'' r -v 
to 6 &y) . 

We now define the HighEnd data structure as follows. Define T = r -max{log(l/e), log(2/a)} for 
a sufficiently large constant r to be determined later. Define t = 3T and instantiate t independent 
copies of the BasicHighEnd data structure. Given an update (j,v), perform the update described 
above to each of the copies of BasicHighEnd. We think of this data structure as a t x s matrix 
of counters Dj^, j G [t] and k G [s]. We let g J be the hash function g in the jth. independent 
instantiation of BasicHighEnd, and similarly define h? . We sometimes use g to denote the tuple 
(g , . . . ,5*), and similarly for h. 

We now define our estimator, but first we give some notation. For w G L, let j(w, 1) < j(w, 2) < 
. . . < j(w, n w ) be the set of n w indices j G [t] such that w is isolated by h? from other indices in L; 
that is, indices j G [t] where no other w' G L collides with w under b? . 

Event £.. Define £ to be the event that n w > T for all w G L. 

If £ does not hold, our estimator simply fails. Otherwise, define 

i T 
x »-^"Z^ e -sign^J ■ V j ( Wtk ) )h j(u„k)( w )- 

fc=i 

If Re[i^] < for any w G L, then we output fail. Otherwise, define 

Our estimator is then ^ = Re[^']. Note x* is a complex number. By z p for complex z, we mean 
|z| p • e ip " AV ^ z \ where arg(z) G (— 7r, 7r] is the angle formed by the vector from the origin to z in the 
complex plane. 

7 



2.2 A useful random variable 

For w G L, we make the definitions 



llu 



$, 



dcf 



*"' /i 



' ■■ 




2/lL 



dcf 



as well as $ = X^eL ®w We assume £ occurs so that the y w and $ w (and hence $) are defined. 
Also, we use the definition (jf) = (]X-=o(P ~~ j))/^' (note p may not be an integer). 

\xl \\p ± 0(e) ■ \\x\\p with large constant probability. Our 
I x ||p with large constant probability, then to 



Our overall goal is to show that ^f 
proof plan is to first show that |3> 
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show that |*' — $| = 0(e) 
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with large constant probability, at which point our claim follows 



by a union bound and the triangle inequality since |\1/ — ||£z,||p| < |^' — ||xi,||f,| since ||xl|Ip is real. 
Before analyzing <3?, we define the following event. 



Event T>. Let T> be the event that for all w G L we have 
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We also define 
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Theorem 5. Conditioned on h, E s [3>] = ||:el|Ip and Var 9 [<l> | V] = 0(V). 
Proof. By linearity of expectation, 
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where we use that (q) = 1. Then E g [y^] = for k > by using linearity of expectation and r 5 -wise 
independence, since each summand involves at most k < r rth roots of unity. Hence, 
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We now compute the variance. Note that if the g 3 were each fully independent, then we would have 
Var g [<J> I V] = J2weL Var 5 [3> w | V] since different <fr w depend on evaluations of the g 3 on disjoint 
v G [n]. However, since r g > 2r/3, E 3 [|<1>| 2 ] is identical as in the case of full independence of the g J . 



We thus have Var 9 [<J> | V] = ^2 weL Var g [& w | V] and have reduced to computing Var^f^ | T>\. 
Vax g [$ w \V] = E g [|^-EJ$J| 2 |P] 



|2P. E 



r/3 

E 

fe=i 



"V 



D 



r/3 



|2p 



p 2 .E ff [|yJ 2 |P] + £0(E p [| yiI ,| 2fc |2?]) 



fc=2 



We have 



V 9 [\y w ?\V]^ui = ±- 2 J2 



T 2 

fc=l v(£L 

h^ w ' k '>(v)=h^ w ' k '>(w) 



T 2 ^ ^ X 2 



(2) 



so that 



]Tp 2 -E 5 [|2/J 2 |P]<pV. 



w<=L 



Eq. ([2]) follows since, conditioned on £ so that j/^ is well-defined 

T T 
T 2 X 



E 9 [|yJ 2 ] 



r t 

2™2 / <* / ^ 



E E 

j ™ k=lk'=: w£L v'£L 



E r 6 -W( S i(«'*>(«)- S J' ( *' y) («'))/»- 



t*v 7 I •// IV • 



When j(w, fe) t^ j(u>, fe') the above expectation is since the g 3 are independent across different j. 
When j(w, k) = j(w, k') the above expectation is only non-zero for v = v' since r g > 2. 
We also have for k > 2 that 

E g [\y w \ 2k \V]<2°^.u 2 w k -(2k) k 

by Lemma [U so that 

r/3 

Y,v 9 l\y w \ 2k \ v] = o(u 2 j 

k=2 

since T> holds and so the sum is dominated by its first term. Thus, Var 9 [<3? | D] = 0(V). M 

Lemma 6. E h [V] < 3a ■ ||x||p P /(4T). 

Proof. For any w E L, v (jL L, and j £ [t], we have Pr^f/i- 7 (w) = h 3 (v)] = 1/s < a/4 since rh > 2. 



Thus 

p\ n/i <r _ 

4T 2 



a '2p-2|_ |2 



ie[t] 




^(en?(«hni?)^ 2)/p )Q(«hni?H 



<S5^ IMIS(« • IMI5) (P_2)/P -(« • ihis) 2/p (3) 



II \\2p /it 

= 4 "«■ Flip / r - 

where Eq. (|3j) used that ||x[n]\i,||i i s maximized when [n]\L contains exactly 1/ct coordinates u each 
with \x v \ p = a\\x\\p, and that |x w | p ~ 2 < (a • ||x||p)«'~ 2 '' p since p < 2. ■ 

Lemma 7. Pr^[£] > 1 — e. 

Proof. For any j G [t], the probability that w is isolated by h 3 is at least 1/2, since the expected 
number of collisions with w is at most 1/2 by pairwise independence of the h? and the fact that 
\L\ < 2/q so that s > 2|L|. If X is the expected number of buckets where w is isolated, the 
Chernoff bound gives Pr h [X < (1 - e)E h [X]] < exp(-e 2 E h [X]/2) for < e < 1. The claim follows 
for r > 24 by setting e = 1/3 then applying a union bound over w £ L. ■ 

Lemma 8. Pr fe [£>] > 63/64. 

Proof. We apply the bound of Lemma [2] for a single w G L. Define X^ = (x 2 /T 2 ) ■ lhJ(y)=hi(w) 
and X = ^1=1 Ylv4L Xj,v Note that X is an upper bound for the left hand side of the inequality 
defining T>, and thus it suffices to show a tail bound for X. In the notation of Lemma [H we have 
o- 2 < (3/(sT 3 )) • ||x [n] \ L |||, K = (a • \\x\\ p p ) 2 / p /T 2 , and n = (3/(sT)) • ||x [n ]\ L ||^. Since ||x[ n ]\i||i and 
ll x [n]\Lll4 are each maximized when there are exactly 1/a coordinates v ^ L with \x v \ p = a • ||x||p, 

cr 2 < — • fa • lbll p ) 4/p u< — -(a- \\x\\ p ) 2/p 

Setting A = (a • ||x||p) 2 ' p /(2r), noting that \i < A for r sufficiently large, and assuming i <ry t 
is even, we apply Lemma [2] to obtain 

Pr[X > 2A] < 2°^ 

By setting r sufficiently large and t = log(2/a) + 6, the above probability is at most (1/64) • (a/2). 
The lemma follows by a union bound over all w G L, since \L\ < 2/ a. ■ 

We now define another event. 
Event T . Let J- be the event that for all w G L we have \y w \ < 1/2. 
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Lemma 9. Pr^fJ 7 | V] > 63/64. 



Proof. T> occurring implies that u w < yl/r < y / l/(64(log(2/a) + 6) (recall we assume 1/a = 
0(l/e 2 ) and pick r = 0(log(l/e)) sufficiently large, and u w is as is denned in Eq. (|1|)), and we also 

have E 3 [|y^|^ | T>] < u^ w yl 2^ by LemmalU Applying Markov's bound on the random variable \y w \ 
for even £ < r g , we have \y w \ is determined by r 9 -wise independence of the gi , and thus 



/ / 16^ 

Pr g [\y w \ > 1/2 | V] < 



_ 64(log(2/a) + 6)y ' 
which equals (1/64) • (a/2) for I = log(2/a) + 6. We then apply a union bound over all w € L. 



/' 



Lemma 10. Given T , \$' — <3?| < e||xin P . 
Proof. Observe 



¥ = J2K\ p -(i + u 



w ) ■ 



weL 



We have that ln(l + z), as a function of z, is holomorphic on the open disk of radius 1 about in 
the complex plane, and thus f{z) = {l + z) p is holomorphic in this region since it is the composition 
exp(p • ln(l + z)) of holomorphic functions. Therefore, f{z) equals its Taylor expansion about 
for all z G C with \z\ < 1 (see for example [39} Theorem 11.2]). Then since J- occurs, we can 
Taylor-expand / about for z = y w and apply Taylor's theorem to obtain 



»„^ T \ l — n V / 



w£L \fc=0 



^-"'U.)' 1 "- 1 "" 1 "' 



$ + [\\xlE'[ ,:, , )-\y 



P \ . . , i-r/3-1 



I ir 



v r/3 + 1, 
The lemma follows since ( r /? + i) < 1 arid \y w \~ r -1 < £ for \y w \ < 1/2. ■ 

Theorem 11. The space used by HighEnd is 0(a~ x log(l/e) log(mM/e) + 0(log 2 (l/e) logn)). The 
update time is 0(log 2 (l/e)). The reporting time is 0(a~ 1 log(l/e) log(l/a)). Also, Pr/j^fl^ — 
\\x L \\ p p \ < 0(e) ■ \\x\\ p p ] >7/8. 

Proof. We first argue correctness. By a union bound, £ and T> hold simultaneously with probability 
31/32. By Markov's inequality and LemmalU V = 0(a- ||a;||p P /r) with probability 63/64. We then 
have by Chebyshev's inequality and Theorem [5] that |$ — ||£l||p| = 0(e) ■ \\x\\p with probability 
15/16. Lemma[IO]then implies |^' - \\x L \\$\ = 0(e) ■ \\x\\% with probability 15/16 - Pr[-.J] > 7/8 
by LemmalU In this case, the same must hold true for \E r since ^ = Ref^'] and ||xi,||p is real. 

Next we discuss space complexity. We start with analyzing the precision required to store the 
counters -Dj,fc- Since our correctness analysis conditions on J 7 , we can assume J- holds. We store 
the real and imaginary parts of each counter Dj^ separately. If we store each such part to within 
precision 7/(2?7iT) for some < 7 < 1 to be determined later, then each of the real and imaginary 
parts, which are the sums of at most m summands from the m updates in the stream, is stored to 
within additive error ^y/(2T) at the end of the stream. Let x* w be our calculation of x* w with such 
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limited precision. Then, each of the real and imaginary parts of x^ is within additive error 7/2 of 
those for x* w . Since T occurs, |x^| > 1/2, and thus 7/2 < 7|x^|, implying |x^| = (1 ± 0(7))|x^|. 
Now we argue arg(ai^) = arg(x^) ± 0(y/j). Write x* w = a + ib and x* w = a + ib with a = a ± 7/2 
and b = b ± 7/2. We have cos(arg(x^)) = aj\J a 2 + b 2 . Also, cos(arg(x^)) = (a ± 7/2)/((l ± 
0(7))\/a 2 + 6 2 ) = (1 ± 0(7))cos(arg(x^)) ± 0(7) = cos(arg(x^)) ± 0(7), implying arg(x^) = 
arg(x^) ± 0(^/7). Our final output is ^weL I^«jI p ' cos(p • arg(x^)). Since cos never has derivative 
larger than 1 in magnitude, this is ^ lu6L [(l±0(7))|x^| p cos(p-arg(x^))±0( 1 y7) • (l±0(7))|x^| p ]. 
Since J- occurs, |x^| p < (3/2) p • |x^| p , and thus our overall error introduced from limited precision is 
0(^/7- ||xl||p), and it thus suffices to set 7 = 0(e 2 ), implying each Dj^ requires 0(log(mM/e)) bits 
of precision. For the remaining part of the space analysis, we discuss storing the hash functions. 
The hash functions h 3 ,g 3 each require 0(log(l/e) logn) bits of seed, and thus in total consume 
0(log 2 (l/e)logn) bits. 

Finally we discuss time complexity. To perform an update, for each j G [t] we must evaluate 
g 3 and hP then update a counter. Each of g 3 ,h 3 require 0(log(l/e)) time to evaluate. For the 
reporting time, we can mark all counters with the unique w £ L which hashes to it under the 
corresponding h 3 (if a unique such w exists) in \L\ ■ t • Th = 0(a~ 1 log(l/e) log(l/a)) time. Then, 
we sum up the appropriate counters for each w £ L, using the Taylor expansion of cos(p ■ arg(z)) 
up to the 0(log(l/e))th degree to achieve additive error e. Note that conditioned on F, arg(x^,) G 
(— 7r/4, 7r/4), so that \p ■ arg(x^)| is bounded away from n/2 for p bounded away from 2; in fact, 
one can even show via some calculus that arg(x^) G (— 7r/6, vr/6) when T occurs by showing that 
cos(arg(x^,)) = cos(arg(l — y w )) is minimized for \y w \ < 1/2 when y w = 1/4 + iy/3/4. Regardless, 
additive error e is relative error 0(e), since if \p ■ arg(z)| is bounded away from 7r/2, then | cos(p • 
arg(z))| = ft(l). U 

3 Estimating the contribution from light elements 

In this section, we show how to estimate the contribution to F p from coordinates of x which are not 
heavy hitters. More precisely, given a list L C [n] such that \L\ < 2/e 2 and \xi\ p < e 2 ||x||p for all 
i ^ L, we describe a subroutine LightEstimator that outputs a value that is ||xr„i\L||f> ± 0(e) ■ \\x\\p 
with probability at least 7/8. This estimator is essentially the same as that given for p = 1 in [30J, 
though in this work we show that (some variant of) the geometric mean estimator of [28] requires 
only bounded independence, in order that we may obtain optimal space. 

Our description follows. We first need the following theorem, which comes from a derandomized 
variant of the geometric mean estimator. Our proof is in Section IA.2I 

Theorem 12. For any < p < 2, there is a randomized data structure D p , and a deterministic 
algorithm Est p mapping the state space of the data structure to reals, such that 

1. B[Est p (D p (x))} = (l±e)\\xf p 

2. E[Est p (D p (x)) 2 ] < C p ■ \\x\\l p 

for some constant C p > depending only on p, and where the expectation is taken over the random- 
ness used by D p . Aside from storing a length-0(e~ p \og{nmM)) random string, the space complexity 
is 0(log(nmM)). The update time is the time to evaluate a Q(l/e p )-wise independent hash function 
over a field of size poly(nmM), and the reporting time is 0(1). 
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We also need the following algorithm for fast multipoint evaluation of polynomials. 

Theorem 13 ( [381 Ch. 10]). Let R be a ring, and let q £ R[x] be a degree-d polynomial. Then, given 
distinct Xi , . . . ,Xd S R, all the values q(x\), . . . ,q(xd) can be computed using 0(dlog dloglogd) 
operations over R. 

The guarantees of the final LightEstimator are then given in Theorem I15[ which is a modified 
form of an algorithm designed in [30] for the case p = 1. A description of the modifications of the 
algorithm in [30] needed to work for p ^ 2 is given in Remark [TO] which in part uses the following 
uniform hash family of Pagh and Pagh [35] . 



Theorem 14 (Pagh and Pagh |35[ Theorem 1.1]). Let S C U = [u] be a set of z > 1 elements, 
and let V = [v], with 1 < v < u. Suppose the machine word size is f2(log(u)). For any constant 
c > there is a word RAM algorithm that, using time log (z) log ' >(v) and 0(log(z) + loglog(u)) 
bits of space, selects a family H of functions from U to V (independent of S) such that: 

1. With probability 1 — 0(1/ z c ), % is z-wise independent when restricted to S. 

2. Any h € % can be represented by a RAM data structure using 0(zlog(v)) bits of space, and 
h can be evaluated in constant time after an initialization step taking 0(z) time. 

Theorem 15 ([30]). Suppose we are given < e < 1, and given a list L C [n] at the end of 

the data stream such that \L\ < 2/e 2 and \xi\ p < e 2 \\x\\p for all i ^ L. Then, given access to a 
randomized data structure satisfying properties (1) and (2) of Theorem \12\ there is an algorithm 
LightEstimator satisfying the following. The randomness used by LightEstimator can be broken up 
into a certain random hash function h, and another random string s. LightEstimator outputs a value 
$' satisfying B h}S [^'] = (1 ± 0(e))||x[ n ]\ L ||^, and E/jVarsf*]?']] = 0(e 2 ||x||p P ). The space usage is 
0(s~ 2 log(nmM)), the update time is (9(log (1/e) loglog(l/e)), and the reporting time is (9(l/e 2 ). 

Remark 16. The claim of Theorem [15] is not stated in the same form in [30] . and thus we 
provide some explanation. The work of [30] only focused on the case p = 1. There, in Section 
3.2, LightEstimator was defined_| by creating R = 4/e 2 independent instantiations of D±, which we 
label D\, . . . ,D± (R chosen so that R > 2|L|), and picking a hash function h : [n] — > [R] from 
a random hash family constructed as in Theorem Q3] with z = R and c > 2. Upon receiving an 
update to Xi in the stream, the update was fed to D 1 . The final estimate was defined as follows. 
Let / = [R]\h(L). Then, the estimate was <&' = (R/\I\) ■ ^\- e/ Esti(-Dj). In place of a generic 
D±, the presentation in [30] used Li's geometric mean estimator |28| . though the analysis (Lemmas 
7 and 8 of |30j ) only made use of the generic properties of D\ and Esti given in Theorem [12] 
Let s = (si, . . . , sr) be the tuple of random strings used by the D\, where the entries of s are 
pairwise independent. The analysis then showed that (a) E^s^'] = (1 ± 0(e))||xr n i\x||i, and (b) 
E/jVar,^']] = 0(e 2 ||x|| 2 ). For (a), the same analysis applies for p/ 1 when using Est p and D p 
instead. For (b), it was shown that E^[Var s [<&']] = 0(||^[ n ]\Lll2 + e2 |l x [n]\illi)- The same analysis 
shows that E/i[Var s [<&']] = 0(||x[ n ]\i|| 2 p + ^ 2 || 3; [n]\Ll|p) for p 7^ 1. Since L contains all the e 2 -heavy 
hitters, ||a7[n]\i H2S ^ s maximized when there are 1/e 2 coordinates i £ [n]\L each with \xi\ p = £ 2 ||:r||p, 

1 ■ 1 II l|2p 2 II 11 2p 

in wmcn case ||^[n]\Lll2» = £ \\ x \\p ■ 



2 The estimator given there was never actually named, so we name it LightEstimator here. 
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To achieve the desired update time, we buffer every d = l/e p updates then perform the fast 
multipoint evaluation of Theorem [13] in batch (note this does not affect our space bound since 
p < 2). That is, although the hash function h can be evaluated in constant time, updating any Dp 
requires evaluating a degree-f2(l/e p ) polynomial, which naively requires £l(l/e p ) time. Note that 
one issue is that the different data structures Dp use different polynomials, and thus we may need 
to evaluate l/e p different polynomials on the l/e p points, defeating the purpose of batching. To 
remedy this, note that these polynomials are themselves pairwise independent. That is, we can 
assume there are two coefficient vectors a, b of length d-\- 1, and the polynomial corresponding to Dp 
is given by the coefficient vector j ■ a + b. Thus, we only need to perform fast multipoint evaluation 
on the two polynomials defined by a and b. To achieve worst-case update time, this computation 
can be spread over the next d updates. If a query comes before d updates are batched, we need to 
perform 0(d log d log log d) work at once, but this is already dominated by our 0(1/ e 2 ) reporting 
time since p < 2. 

4 The final algorithm: putting it all together 

To obtain our final algorithm, one option is to run High End and LightEstimator in parallel after 
finding L, then output the sum of their estimates. Note that by the variance bound in Theorem 1151 
the output of a single instantiation of LightEstimator is ||xr n i\£,||p =t 0(e , )||x||p with large constant 
probability. The downside to this option is that Theorem Q] uses space that would make our overall 
.Fp-estimation algorithm suboptimal by polylog(n/e) factors, and HighEnd by an 0(log(l/e)) factor 
for a = e 2 (Theorem [TT1) . We can overcome this by a combination of balancing and universe 
reduction. Specifically, for balancing, notice that if instead of having L be a list of e 2 -heavy hitters, 
we instead defined it as a list of e 2 -heavy hitters for some e > e, we could improve the space of 
both Theorem [1] and Theorem II li To then make the variance in LightEstimator sufficiently small, 
i.e. 0(e 2 ||3;||p), we could run 0((e/e) 2 ) instantiations of LightEstimator in parallel and output the 
average estimate, keeping the space optimal but increasing the update time to VL((e/e) 2 ). This 
balancing gives a smooth tradeoff between space and update time; in fact note that for e = 1, our 
overall algorithm simply becomes a derandomized variant of Li's geometric mean estimator. We 
would like though to have e <C 1 to have small update time. 

Doing this balancing does not resolve all our issues though, since Theorem [1] is suboptimal 
by a logn factor. That is, even if we picked e = 1, Theorem [1] would cause our overall space to 
be 0(log(n) log(mM)), which is suboptimal. To overcome this issue we use universe reduction. 
Specifically, we set N = 1/e 18 and pick hash functions h\ : [n] — > [N] and a : [n] — > { — 1,1}. We 
define a new iV-dimensional vector y by yi = J2h 1 (j)=i a U) x j- Henceforth in this section, y, hi, 
and a are as discussed here. Rather than computing a list L of heavy hitters of x, we instead 
compute a list L' of heavy hitters of y. Then, since y has length only poly(l/e), Theorem Q] is 
only suboptimal by polylog(l/e) factors and our balancing trick applies. The list V is also used in 
place of L for both HighEnd and LightEstimator. Though, since we never learn L, we must modify 
the algorithm LightEstimator described in Remark [16j Namely, the hash function h : [n] — > [R] in 
Remark [16] should be implemented as the composition of hi, and a hash function /12 : [N] — > [R] 
chosen as Theorem [HI (again with z = R and c = 2). Then, we let / = [R]\h2(L'). The remaining 
parts of the algorithm remain the same. 

There are several issues we must address to show that our universe reduction step still maintains 
correctness. Informally, we need that (a) any i which is a heavy hitter for y should have exactly 
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one j G [n] with /ii(j') = i such that j was a heavy hitter for x, (b) if i is a heavy hitter for x, 
then /ii(i) is a heavy hitter for y, and |ywj)| p = (1 =tO(e))|xj| p so that x^s contribution to ||x||p is 
properly approximated by HighEnd, (c) \\y\\p = 0(\\x\\ p ) with large probability, since the error term 
in HighEnd is 0(e ■ ||y||p), and (d) the amount of F p mass not output by LightEstimator because 
it collided with a heavy hitter for x under h± is negligible. Also, the composition h = hi o hi 
for LightEstimator does not satisfy the conditions of Theorem [14] even though h\ and hi might 
do so individually. To see why, as a simple analogy, consider that the composition of two purely 
random functions is no longer random. For example, as the number of compositions increases, the 
probability of two items colliding increases as well. Nevertheless, the analysis of LightEstimator 
carries over essentially unchanged in this setting, since whenever considering the distribution of 
where two items land under h, we can first condition on them not colliding under hi. Not colliding 
under hi happens with 1 — 0(e 18 ) probability, and thus the probability that two items land in two 
particular buckets j,j' G [R] under h is still (1 ± o{e))j ' R 2 . 

We now give our full description and analysis. We pick h\ as in Theorem [TH with z = R and 
c = Ch a sufficiently large constant. We also pick a from an 0(log N)-wise independent family. We 
run an instantiation of F p HH for the vector y with <f> = e 2 /(34C) for a sufficiently large constant 
C > 0. We also obtain a value F p G [F p /2,3F p /2] using the algorithm of [26]. We define L' to be 
the sublist of those w output by our F p HH instantiation such that \y w \ p > (2e 2 /7)F p . 

For ease of presentation in what follows, define L^ to be the list of <^>-heavy hitters of x with 
respect to F p ("L", without a subscript, always denotes the e 2 -heavy hitters with respect to x), 
and define Z{ = EajgrVilU cr(w)x w , i.e. Z{ is the contribution to yi from the significantly light 
elements of x. 

Lemma 17. For x G W 1 , A > with A 2 a multiple of 8, and random z G {— 1, l} n drawn from a 
(A 2 /4)-TOse independent family, Pr[|(x,z)| > A|| a? || 2] < 2 ■ 

Proof. By Markov's inequality on the random variable (x,z) x ' 4 , Pr[|(x,z)| > A] < A~ A ' 4 • 
E[(x, z) ' ]. The claim follows by applying Lemma El ■ 

Lemma 18. For any C > 0, there exists Eq such that for < e < Eq, Pr[||y||p > 17C||x||p] < 2/C . 

Proof. Condition on h\. Define Y(i) to be the vector x^-i^y For any vector v we have \\v\\2 < \\v\\ p 

sincep < 2. Letting £ be the event that no i G [N] has \yi\ > 4y/\og N\\Y(i)\\ p , we have Pr[£] > 1 — 
1/A^ 4 by LemmadH For i G [N], again by Lemma [T7| any i G [N] has \yi\ <2t-\\Y(i)\\ 2 <2t-\\Y{i)\\ p 
with probability at least 1 — max{l/(2iV), 2~* }. Then for fixed i G [N], 

00 
E[\ Vi \P I £] < 2P\\Y(i)\\P + J^Pr [(2 • 2*)P||y(i)||P < \ yi \r < (2 • 2 t+l f\\Y(i)\\l \ 8 ] ■ (2 • 2 t+l f \\Y(i)\\ 

t=o 



< 2^\\Y(zW p + (1/Pr[£]) • Y, 2 " 2 " • ( 2 ' ^ +1 ) P \\Y(i)\\ p P 

log(2v/I^7V) 

< 4||y(z)|| p + (1/Pr[£]) • Yl 2 " 2 " • ( 2 • 2 t+1 ) 2 m 



i> 



t=o 



<!7\\Y(iW p 
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since Pr[£] > 1 — 1/iV 4 and £o is sufficiently small. Thus by linearity of expectation, E[||y||p | £] < 
17||x||p, which implies ||y||p < 17C||x||ip with probability 1 — 1/C, conditioned on £ holding. We 
conclude by again using Pr[£] > 1 — 1/N 4 . ■ 

Lemma 19. With probability at least 1 — poly(e) over a , simultaneously for all i G [N] we have 
that \ Zi \ = 0(^log(l/e)-e 6 / p ||x|| p ). 

Proof. By Lemma [T71 any individual i G [N] has \zi\ < 4y / log(l/e) • (^2 w eh~ 1 (i)\L \ x w\ 2 ) 1 ^ 2 with 
probability at least 1 — I/TV 4 . We then apply a union bound and use the fact that £ p < £2 for p < 2, 
so that \zi\ < 4^/log(l/e) • {^2 w£h -'i-i i \\ L \x w \ p ) 1 ' p (call this event £) with probability 1 — poly(e). 

We now prove our lemma, i.e. we show that with probability 1 — poly(e), \zi\ p = 0(log p ' e 6 ||x||p) 
simultaneously for all i G [N]. We apply Lemma [2j Specifically, fix an i G [N]. For all j with 
\xj\ p < e 8 ||x||p, let Xj = \xj\ p • lf ll /j\ =i . Then, in the notation of Lemma [21 fij = \xj\ p /N, and 
Oj < \xj\ 2p /N, and thus ft = \\x\\ p p /N and a 2 < ||x||g/iV < e 8 ||x||^/iV. Also, K = e*\\x\\ p p . Then if 
h\ were £-wise independent for I = 10, Lemma [2] would give 



Pr 



J2Xi-\\x\\P p /N 



> f 6 llrll p 



<2°M.(e n + e 2i ) = 0(e/N). 



A union bound would then give that with probability 1 — e, the F p mass in any bucket from items i 
with \xi\ p < s 8 \\x\\ p is at most e 6 ||x||p. Thus by a union bound with event £, \zi\ p = 0(log p ' e 6 ||x||p) 
for all i G [N] with probability 1 — poly(e). 

Though, h\ is not 10-wise independent. Instead, it is selected as in Theorem I14i However, for 
any constant £, by increasing the constant Ch in our definition of h\ we can ensure that our ^th 
moment bound for (^2% -^i — jJ>) is preserved to within a constant factor, which is sufficient to apply 
Lemma [2j ■ 

Lemma 20. With probability 1 — poly(e), for all w G L we have |y/ ll ( t0 )| p = (1 ± 0{e))\x w \ p , and 
thus with probability 1 — poly(e) when conditioned on \\y\\ p < 17C||x||p, we have that if w is an 
a-heavy hitter for x, then h±(w) is an a/(34C)-heavy hitter for y. 

Proof. Let w be in L. We know from Lemma [19] that |^/ ll ( lu )| < 2y / log(l/e)e 6 ' p ||x||p with 
probability 1 — poly(e), and that the elements of L are perfectly hashed under hi with probability 
1 — poly(e). Conditioned on this perfect hashing, we have that | yh 1 ( w ) I ^ |xu;| — 2e 6 / p ^/log(l/e)||x|| p . 
Since for w G L we have \x w \ > e 2 ' p \\x\\ p , and since p < 2, we have |y^ x ( w )| > (1 — 0{e))\x w \. 

For the second part of the lemma, (1 — 0(e))\x w \ > \x w \/2 l ' p for eq sufficiently small. Thus if 
w is an a-heavy hitter for x, then hi(w) is an a/(34C)-heavy hitter for y. ■ 

Finally, the following lemma follows from a Markov bound followed by a union bound. 
Lemma 21. For w G [n] consider the quantity s w = ^ v ^ w \x v \f . Then, with probability at 

h(v)=h(w) 

least 1 — 0(e), s w < e 15 ||x||p simultaneously for all w G L. 

We now put everything together. We set e = e\og{\/e). As stated earlier, we define L' to be 
the sublist of those w output by our F p HH instantiation with <fi = e 2 such that \y w \ p > (2e 2 /7)F p . 
We interpret updates to x as updates to y to then be fed into HighEnd, with a = e 2 /(34C). Thus 
both HighEnd and F p HH require 0(e~ 2 log(nmM/e)) space. We now define some events. 
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Event A. L £ & is perfectly hashed under hi, and Vi G [iV], |2;j| p = 0(log(l/e) P//2 • e 6 ||x||p). 

Event £>. Vw G L e 2, h\{w) is output as an e 2 /(34C)-heavy hitter by F p HH. 

Event C. Vw; G L e 2 /18 , |j/ Al(tu )| = (1 ± 0{e))\x w \. 

Event V. F p G [(1/2) • \\x\\p, (3/2) • ||x||p], and HighEnd, LightEstimator, and F p HH succeed. 

Now, supposed, £>, C, and V all occur. Then for w G L e 2, w is output by F p HH, and furthermore 
\y hl{w )\ p > (1 - 0(e)) W > |x tt |P/2 > e 2 \\xf p /2. Also, y£ i(w) > (6/7) • \y hl(w) \P. Since F p < 
3||x||p/2, we have that h\(w) G L' . Furthermore, we also know that for i output by F p HH, y P < 
(9/7) • \yi\ p , and thus i G L' implies \yi\ p > (e 2 /9) • ||x||p. Notice that by event A, each yi is Z{, plus 
potentially x w fj\ for some x w n\ G L £ 8. If \yi\ p > (e 2 /9) • \\x\\p, then there must exist such a w(i), 
and furthermore it must be that |2Wj)| p > (e 2 /18) • ||x||p. Thus, overall, V contains hi(w) for all 
w G L e 2, and furthermore if i G -L' then u)(i) G ^ e 2 /i8- 

Since L' contains h\{L t 2), LightEstimator outputs ||xr n ]\^-i/£,/d|^ ± 0(e||x||p). Also, HighEnd 
outputs ||y£,/||±0(e)-||2/||p. Now we analyze correctness. WehavePr[„4] = 1— poly(e), Pr[B \ \\y\\p < 
17C\\x\\$] = l-poly(e), Pr[C] = l-poly(e), andPr[P] > 5/8. We also have Pr[\\y\\ p < 17C||x||?] > 
1 — 2/C. Thus by a union bound and setting C sufficiently large, we have ~Pr[AAB/\C /\T>A(\\y\\p < 
17C||x[|p)] > 9/16. Define L mv to be the set {w(i)}i & L' , i.e. the heavy hitters of x corresponding 
to the heavy hitters in V for y. Now, if all these events occur, then ||x[ n i\/i-i(z/)||p = ll x [n]\L inv llp i 
0(e 15 )||x||p with probability 1 — 0(e) by Lemma[2TJ We also have, since C occurs and conditioned 
on \\y\\p = 0(\\x\\p), that \\yvW ±0(e) • ||y||p = ||xi inv ||p ±0(e) • \\x\\p. Thus, overall, our algorithm 
outputs ||x||p ± 0(e) ■ ||x||p with probability 17/32 > 1/2 as desired. Notice this probability can be 
amplified to 1 — 5 by outputting the median of 0(log(l/<5)) independent instantiations. 

We further note that for a single instantiation of LightEstimator, we have E/i[Var s [$']] = 
0(e 2 ||:r||p P ). Once h is fixed, the variance of <£' is simply the sum of variances across the Dj 
for j tfi hi(L'). Thus, it suffices for the Dj to use pairwise independent randomness. Furthermore, 
in repeating 0((e/e) 2 ) parallel repetitions of LightEstimator, it suffices that all the Dj across all 
parallel repetitions use pairwise independent randomness, and the hash function h can remain the 
same. Thus, as discussed in Remark [161 the coefficients of the degree-0(l/e p ) polynomials used 
in all Dj combined can be generated by just two coefficient vectors, and thus the update time of 
LightEstimator with 0((e/e) 2 ) parallel repetitions is just 0((e/e) 2 + 0(log 2 (l/e) log log(l/e))) = 
0(log (l/e)loglog(l/e)). Thus overall, we have the following theorem. 

Theorem 22. There exists an algorithm such that given < p < 2 and < e < 1/2, the algorithm 
outputs (1 ± e)||x||p with probability 2/3 using 0(e~ 2 log (nmM/e)) space. The update time is 
0(log 2 (l/e) loglog(l/e)). The reporting time is 0(e~ 2 log 2 (l/e) loglog(l/e)). 

The space bound above can be assumed 0(e~ 2 log(mM) + log log n) by comments in Section Pl,31 
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A Appendix 

A.l A heavy hitter algorithm for F p 

Note that F p Report, F p Update, and F p Space below can be as in the statement in Section[2]by using 
the algorithm of 



Theorem [T] (restatement). There is an algorithm F p HH satisfying the following properties. 
Given < 4>, 8 < 1/2 and black-box access to an F p - estimation algorithm F p Est(e / ,<5 / ) with e' = 1/7 
and 8' = (j)8/ (12 (log ((f>n) + 1)), F p HH produces a list L such that L contains all (p-heavy hitters 
and does not contain indices which are not <j)/2-heavy hitters with probability at least 1 — 5. For 
each i G L, the algorithm also outputs sign(xj), as well as an estimate Xi of Xi satisfying x? G 
[(6/7)|xi| p , (9/7)\xi\ p }. Its space usage is O(0 _1 log(^n)-F p Space(e', 8')+<f>^ 1 log(l/(<5<£)) log(nmM)). 
Its update time is 0(log((j)n) • F p Update(e / , 5') + log(l/(8(j)))). Its reporting time is 0(4>~ 1 (log((j)n) ■ 
F p Report(e / ,5 / ) + log(l/(<5<^)))). Here, FpReport(e', 5'), F p Update(e / , 8'), and F p Space(e', 5') are the 
reporting time, update time, and space consumption of F p Est when a (1 ± e')- approximation to F p 
is desired with probability at least 1 — 5' . 

Proof. First we argue with 8' = <^><5/(12(log n + 1)). We assume without loss of generality that n is 
a power of 2. Consider the following data structure BasicF p HH(<//, 8, e' , k), where k G {0, . . . , log n}. 
We set R = \l/<ft'~\ and pick a function h : {0, ...,2 — 1} — > [R] at random from a pairwise 
independent hash family. We also create instantiations D±, . . . ,Dr of F p Est(e', 1/5). This entire 
structure is then repeated independently in parallel T = @(log(l/<5)) times, so that we have hash 
functions hi,... , for, and instantiations Dj of F p Est for i,j G [R] x [T]. For an integer x in [n], 
let prefix(x, A;) denote the length-/c prefix of x — 1 when written in binary, treated as an integer in 
{0, . . . , 2 k — 1}. Upon receiving an update (i, v) in the stream, we feed this update to D 3 h , refix ^ fe -,\ 
for each j G [T] . 

For t G {0, . . . , 2 k — 1}, let F p (t) denote the F p value of the vector x restricted to indices i G [n] 
with prefix(i) = t. Consider the procedure Query(t) which outputs the median of F p -estimates 
given by Di.r t \ over all j G [T]. We now argue that the output of Query (t) is in the interval 

[(l-e / )-i ? P (i),( 1 + e')-( i? p( i ) + 5( ? :,/ |kllp)]] 1 i- e - Query(t) "succeeds", with probability at least 1 -8. 
For any j G [T] , consider the actual F p value F p (t) J of the vector x restricted to coordinates i 
such that hj(pref\x(i,k)) = hj(t). Then F p {ty = F p {t) + R{t)° , where R(ty is the F p contribution 
of the i with prefix(i, k) ^ t, yet hj(pref'\x(i, k)) = h(t). We have R(ty > always, and furthermore 
E[R(ty] < \\x\\p/R by pairwise independence of hj. Thus by Markov's inequality, ~Pr[R(ty > 
5(f)'\\x\\p] < 1/5. Note for any fixed j G [T], the F p -estimate output by D 3 h , t , is in [(1 — e') ■ 

F p (t), (1 + e') ■ (F p (t) + 5<^'||jc||p)]] as long as both the events "-DL t \ successfully gives a (1 ± e')- 

approximation" and "R(ty < 5^>'||x||p" occur. This happens with probability at least 3/5. Thus, 
by a Chernoff bound, the output of Query(t) is in the desired interval with probability at least 1 — 8. 
We now define the final F p HH data structure. We maintain one global instantiation D of 
F p Est(l/7, 8/2). We also use the dyadic interval idea for Li-heavy hitters given in [ID]. Specifically, 
we imagine building a binary tree T over the universe [n] (without loss of generality assume n 
is a power of 2). The number of levels in the tree is I = 1 + logn, where the root is at level 
and the leaves are at level logn. For each level j G {0, ...,£}, we maintain an instantiation Bj of 
BasicF p HH((/>/80, 5' , 1/7, j) for 8' as in the theorem statement. When we receive an update (i,v) in 
the stream, we feed the update to D and also to each Bj. 
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We now describe how to answer a query to output the desired list L. We first query D to 
obtain F p , an approximation to F p . We next initiate an iterative procedure on our binary tree, 
beginning at the root, which proceeds level by level. The procedure is as follows. Initially, we 
set L = {0}, U = 0, and j = 0. For each i G L, we perform Query (i) on Bj then add 2i and 
% + 1 to V if the output of Query(i) is at least 30-F p /4. After processing every % E L, we then 
set L <— L' then L' <— 0, and we increment j. This continues until j = 1 + logn, at which point 
we halt and return L. We now show why the list L output by this procedure satisfies the claim 
in the theorem statement. We condition on the event £ that F p = (1 ± l/7)F p , and also on the 
event £' that every query made throughout the recursive procedure is successful. Let i be such 
that \xi\ p > (j)F p . Then, since F p (pref'\x(i,j)) > \xi\ p for any j, we always have that prefix(z, j) G L 
at the end of the jth round of our iterative procedure, since (6/7)|xj| p > (3/4)<pF p given £. Now, 
consider an i such that \xi\ p < (<f>/2)F p . Then, (8/7) • (\xi\ p - 5 • (0/80)) < 3(f>F p /A, implying i is 
not included in the final output list. Also, note that since the query at the leaf corresponding to 
j £ Lis successful, then by definition of a successful query, we are given an estimate x P of \xi\ p by 
the corresponding BasicF p HH structure satisfying x P G [(6/7)\xi\ p , (8/7)\xi\ p + (cj)/lQ)F p ], which is 
[(6/7)1^, (9/7)\ Xi \P] since \ Xl \ p > (<P/2)F p . 

We now only need to argue that £ and £' occur simultaneously with large probability. We 
have Pr[£] > 1 — 5/2. For £', note there are at most 2<fi 0/2-heavy hitters at any level of the 
tree, where at level j we are referring to heavy hitters of the 2 J -dimensional vector yj satisfying 
(yj) P = ^prefix(t,j) = i|xt| p . As long as the Query(-) calls made for all 0/2-heavy hitters and 
their two children throughout the tree succeed (including at the root), £' holds. Thus, Pr[f] > 
1 — 5' • 6(log n + 1)0 _1 = 1 — 5/2. Therefore, by a union bound Pr[£ A £'] > 1 — 5. 

Finally, notice that the number of levels in F p HH can be reduced from log n to log n— log [1/0] = 
O(log(0n)) by simply ignoring the top log [1/0] levels of the tree. Then, in the topmost level of 
the tree which we maintain, the universe size is O(l/0), so we can begin our reporting procedure 
by querying all these universe items to determine which subtrees to recurse upon. 

To recover sign(x w ) for each w G L, we use the CountSketch data structure of [8] with T = 
(21 • 2 p )/(f) columns and C = 9(log(l/(<50))) rows; the space is 0(4>~ 1 log(l/(5(j)))log(nmM)), 
and the update time is 0(log(l/(5cp))). CountSketch operates by, for each row i, having a pairwise 
independent hash function hi : [n] — > [T] and a 4-wise independent hash function a : [n] —> { — 1, 1}. 
There are C ■ T counters Aij for (i,j) G [C] x [T]. Counter Aij maintains ^ h .u ■ o"i(«) • x v . For 
(i,j) G [C] x [T], let x l be the vector x restricted to coordinates v with hi(v) = hi(w), other than 
w itself. Then for fixed i, the expected contribution to \\x l \\ p is at most ||x||p/T, and thus is at 
most 10||x||p/T with probability 9/10 by Markov's inequality. Conditioned on this event, \x w \ > 
H a;l ||p/2 > ||ic l ||2/2. The analysis of CountSketch also guarantees \A i ^.i w \ — <Ji{w)x w \ < 2||x*||2 with 
probability at least 2/3, and thus by a union bound, \x w \ > l^/^O) — &i{w)x w \ with probability 
at least 11/20, in which case Oi(w) ■ s\gxv(A ihi r w )) = sign(x^). Thus, by a Chernoff bound over 
all rows, together with a union bound over all w G L, we can recover sign(x„,) for all w G L with 
probability 1 — 5. ■ 

A.2 Proof of Theorem Q21 

In this section we prove Theorem [I2j The data structure and estimator we give is a slightly modified 
version of the geometric mean estimator of Li |28j . Our modification allows us to show that only 
bounded independence is required amongst the p-stable random variables in our data structure. 
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Before giving our D p and Est p , we first define the p- stable distribution. 

Definition 23 (Zolotarev [13]). For < p < 2, there exists a probability distribution D p called 
the j>stable distribution satisfying the following property. For any positive integer n and vector 
x € R n , if Zi, . . . , Z n ~ Dp are independent, then Y^j=i ^j x j ~ ll^llp^ f or % ~ T> p . 

Li's geometric mean estimator is as follows. For some positive integer t > 2, select a matrix 
A € W txn with independent p-stable entries, and maintain y = Ax in the stream. Given y, the 
estimate of ||x||p is then Ct, p ■ (Y\j=i\yj\ ) f° r some constant Ct >p . For Theorem [T2| we make 
the following adjustments. First, we require t > 4. Next, for any fixed row of A we only require 
that the entries be Q(l/e p )-wise independent, though the rows themselves we keep independent. 
Furthermore, in parallel we run the algorithm of |26| with constant error parameter to obtain a 
value F p in [||x||p/2, 3||x||p/2]. The D p data structure of Theorem [T21 is then simply y, together with 
the state maintained by the algorithm of [26]. The estimator Est p is min{Ct iP • (Ilfci \Vj\ )> Fp/ £ }- 
To state the value Ct tP , we use the following theorem. 

Theorem 24 (03] Theorem 2.6.3]). For Q ~ V p and -1 < A < p, 

E[|Q| A ] = ^r(l-^)r(A)singA 
Theorem 1241 implies that we should set 
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To carry out our analysis, we will need the following theorem, which gives a way of producing 
a smooth approximation of the indicator function of an interval while maintaining good bounds on 
high order derivatives. 

Theorem 25 ([E]). For any interval [a,b] C M. and integer c > 0, there exists a nonnegative 
function I? ab -\ '■ K — > M satisfying the following properties: 

*• \\(I[a M ) {i) Woo <(2c) £ for all I >0. 

ii. For any i£l, \I( aM (x) ~ I[a,b]( x )\ ^ min{l, 5/(2c 2 • d(x, {a, b}) 2 )}. 

We also need the following lemma of [26], which argues that smooth, bounded functions have 
their expectations approximately preserved when their input is a linear form evaluated at boundedly 
independent p-stable random variables, as opposed to completely independent p-stable random 
variables. 

Lemma 26 (|26, Lemma 2.2]). There exists an £q > such that the following holds. Let < e < £o 
and < p < 2 be given. Let f : R — y M satisfy ||/^||oo = 0(c/) for all £ > 0, for some a satisfying 
a p > log(l/e). Let k = a p . Let x G M. n satisfy \\x\\ p = 0(1). Let R±, . . . , R n be drawn from a 
3Ck-wise independent family of p- stable random variables for C > a sufficiently large constant, 
and let Q be the product of \\x\\ p and a p-stable random variable. Then |E[/(i?)] — E[/(Q)]| = 0{e). 
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We now prove a tail bound for linear forms over £;-wise independent p-stable random variables. 
Note that for a random variable X whose moments are bounded, one has Prpf — E[X] > t] < 
E[(X — E[X]) fc ]/i fc by applying Markov's inequality to the random variable (X — E[X]) fc for some 
even integer k > 2. Unfortunately, for < p < 2, it is known that even the second moment of 
T> p is already infinite, so this method cannot be applied. We instead prove our tail bound via 
FT-mollification of I [t>oo) , since Pr[X > t] = B[I [tt(Xl) (X)]. 

We will need to refer to the following lemma. 

Lemma 27 (Nolan [331 Theorem 1.12]). For fixed < p < 2, the probability density function 
ifp of the p-stable distribution satisfies <p p (x) = 0(1/(1 + |x| p+1 )) and is an even function. The 
cumulative distribution function satisfies $ p (x) = 0(\x\~ p ). 

We now prove our tail bound. 

Lemma 28. Suppose x E W 1 , \\x\\ p = 1, < s < 1 is given, and R\, . . . , R n are k-wise independent 
p-stable random variables for k > 2. Let Q ~ T> p . Then for all t > 0, R = Y^i=x Ri x i satisfies 

|Pr[|Q| >t\- Pr[|J2| > t]\ = 0(k~ l/p /(l + t p+1 ) + AT 2 / p /(l + t 2 ) + 2~ n ^). 

Proof. We have Pr[|Z| > t] = E[/[ t)00 )(Z)] + E[/(_ 00)t ](Z)] for any random variable Z, and thus 
we will argue |E[J [ti0o) (Q)]-E[J [ti0o) (k)]| = 0(r 1 / p /(l + f +1 ) + ^" 2/p /(l + f 2 ) + 2- SJ W); a similar 
argument shows the same bound for ^[//^^(Q)] — E[/(_ 00)t ](i2)]|. 

We argue the following chain of inequalities for c = k 1 ' p /(3C), for C the constant in Lemma [26] 
and we define 7 = fc-VP/(i + t p+1 ) + k~ 2 / p /{l + t 2 ): 

E[/ [t , oo) (Q)] « 7 W[t,oo)(Q)\ « 2 -«* E[7f t]0o) (i?)] « 7+2 _ cP E[I %O0) (R)]. 

E[I [t ,oo)(Q)] ~7 E [I[t,oo)(Q)] : Assume t > 1. We have 

|E[/ [4i00) (Q)] - E[If ii0o) (Q)]| < E[|/ [ti0o) (Q) - /[ C ti00) (Q)|] 

/log(ci)-l \ 

<Pr[|Q-t|<l/c]+f ^ Pr[|Q-i|<27c].0(2- 2s ) (4) 

+ Pr[|Q-i| > t/2] • 0(c- 2 t~ 2 ) 
= 0(l/(c-t p+1 )) + (9(c"V 2 ) 

since Pr[|Q - t\ < 2 s /c is 0(2"/(c • t p+1 ) as long as 2 s /c < t/2. 

In the case < t < 1, we repeat the same argument as above but replace Eq. @ with a 
summation from s = 1 to 00, and also remove the additive Pr[|Q — t\ > t/2] ■ 0(c~ 2 t~ 2 ) term. 
Doing so gives an overall upper bound of 0(l/c) in this case. 

E|fe oo) (Q)] ~ 2 -<= p E [Ift oo)( R )] : This follow s from Lemma EB1 with e = 2~ cP and a = c. 

E[K ^(R)] ~ T+2 - cP E[I[ tj0 o)(R')] : We would like to apply the same argument as when showing 
E[Jf i)00) (Q)] « 7 E[/ [t]0o) (Q)] above. The trouble is, we must bound ~Pr[\R-t\ > t/2] and Pr[|i?-i| < 
2 s /c] given that the Ri are only fe-wise independent. For the first probability, we above only used 
that Pr[|Q — t\ > t/2] < 1, which still holds with Q replaced by R. 
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For the second probability, observe Pr[|i? — t\ < 2 s jc\ = E[Ij t „ 2 s /c,t+2 s /c](^)]- Define 5 = 
2 s /c + b/c for a sufficiently large constant b > to be determined later. Then, arguing as above, we 
have E[If t _ S j +s] (R)] « 2 - c p E[If t _ Sjt+S] (Q)] ~ 7 W[t-d,t+6](Q)], and we also know E[I [t _ d>t+s] (Q)] = 
0(B[I [t _ 2 s^ t+2 s /c] (Q)}) = 0(Pr[|Q-i| < 27c]) = 0(2 s /(c-t?+ 1 )). Now, for x $ [t - 2 s /c, t + 2 s /c] , 
I[t-2 s /c,t+2 s /c]( x ) = while I\t-s,t+S]( x ) = 1- For ^ ^ [* ~~ 2 s /c,t + 2 s /c], the distance from a; to 
{t — 5, t + 5} is at least b/c, implying I? t _ 5t+S \(x) > 1/2 for 6 sufficiently large by item (ii) of 
Lemma [25l Thus, 2 • J£ 5t , g, > I[t~2 s /c,t+2 s /c] on ^> an d thus in particular, E[ir t _2»/c,t+2 s /c]C^)] ^ 
2 • E[I-_ 5)t+5] (i?)]. Thus, in summary, E[I [t _ 2 s /c>t+2 , /c] (R)} = 0(2 s /(c • ^ +1 ) + 7 + 2" cP ). ■ 

We now prove the main lemma of this section, which implies Theorem 1121 

Lemma 29. Let x G W 1 be such that \\x\\ p = 1, and suppose < e < 1/2. Let < p < 2, 
and let t be any constant greater than A/ p. Let R\, . . . , R n be k-wise independent p-stable random 
variables for k = Q(l/e p ), and let Q be a p-stable random variable. Define f(x) = minHxp^T}, 
for T = 1/e. Then, \B[f(R)] - MQ\ l/t ] = 0(e) and E[f 2 (R)] = 0(E[|Q| 2 /*]). 

Proof. We first argue |E[/(i?)] — EflQI 1 '*] = 0(e). We argue through the chain of inequalities 

E[|Q| 1 A]« £ E[/(Q)]« £ E[/(i?)]. 

EflQI 1 /*] ^ £ E[f(Q)]: We have 

poo 

lEflQI 1 /*] _ E [/(Q)]| = 2 / (x 1 /* _ T ) ■ <p p (x)dx 

(x 1 /* - T) ■ 0(l/x p+1 )dx 



= o (t 1 -^ ■ (—*— + - 
V \pt-i P 

= 0{l/(Tp)) 

= 0(8) 

E[f(Q)] « e E[f(R)]: Let (p£ be the probability density function corresponding to the distribution 
of \Q\, and let $+ be its cumulative distribution function. Then, by integration by parts and 
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Lemma _ 

rT 



E[/(Q)]=/ x l/t ip+(x)dx + T- ip+{x)dx 

Jo Jt* 

= _ [a .i/t . (1 _ *+( x ))]jf - r ■ [(i - *+(*))]£ + i jT r ^(1 - ^(x))dx 

= \j^ -2-. V T[\Q\>x]dx 

i r Tt 1 

= - / ——J- ■ (Pr[\R\ >x] + 0(k~ l/p l/(l + x p+1 ) + A;- 2 / p /(l + x 2 ) + 2- Q ^))dx 
t J Q x 1 V* 

= E[/(i?)]+ / a; 1 /*- 1 -0(fe- 1 / p + A;- 2 / p + 2- a W))dx 
Jo 

+ / a; 1 /*" 1 • 0(k~ 1/p /x p+l + k~ 2 l p jx 2 + 2- Q ^))dx (5) 



Ef/^ + O^ + O^.j^^ 



1 



\-p-l) 

= E[/(i?)] + 0(e) 
We show E[/ 2 (i?)] = 0(|Q| 2 '*) similarly. Namely, we argue through the chain of inequalities 

E[|Q| 2 A]« £ E[/ 2 (Q)]« £ E[/ 2 (i?)], 

which proves our claim since E[|Q| 2 ] = ^(1) by Theorem 1241 

E[|Q| 1A ] ~ £ E[f 2 (Q)]: We have 

POO 

HIQI 2 /*] - E[/ 2 (Q)]| = 2 / (x 2 /* - T 2 ) • ^(xjds 

/■oo 

/ (x 2/ *-T 2 )-0(l/^ +1 )dx 

Jt* 

o (r 2 - tp 



|E[ 



^p£ — 2 p 

= 0(l/(Tp)) 
= 0(e) 

E[f 2 (Q)] « e E[f 2 (R)]: This is argued nearly identically as in our proof that E[/(Q)] ss e E[/(i2)] 
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above. The difference is that our error term now corresponding to Eq. ([5]) is 

1 /•T* 

x 2 ^ 1 ■ 0{k~ l l p + k~ 2 /P + 2~ n ^))dx + / x 2 /*" 1 • 0(k- 1 / p /x p+1 + k- 2 / p /x 2 + 2- n ^))da 

Ji 

= O(e) + O ( * • (— L_j - lV 



0(e) 
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